IODS-course introduces the basics of data science, and touches upon themes of open science and the use of open data. During the course participants learn basic tools to use with open data with R, and learn the principles of open science.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 12 16:27:45 2022"
Assignment1
Exercise set 1 is fairly good crashcourse to R. It introduces a lot of useful commands that are often needed to tidy the data for further analysis. Most of the commands were familiar, but I did not know that one can use pipe when doing analyses. That will be helpful with e.g. doing analyses for subset of the data. For a introduction-level book I prefer data analysis with R (the one with the parrot). Having read a couple of introductory level books on R and data-analysis I think the differences are minor, and at least any of the ones I have read would suffice for a beginner.
Did the thing with the token, took some time, but it works. Who would have thought that university laptops would require remote access to install GIT.
link to github diary: https://esoini.github.io/IODS-project/ repository: https://github.com/esoini/IODS-project
Describe the work you have done this week and summarize your learning.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
date()
## [1] "Mon Dec 12 16:27:45 2022"
Reading the data
rm(list = ls())
#setwd("/Users/eetusoin/Desktop/IODS")
lrn14<-read.csv("./data/learning2014.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
fig <- ggpairs(lrn14, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig
lm<-(lm(points~attitude+stra+surf, data = lrn14))
summary(lm)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
lm1<-(lm(points~attitude+stra, data = lrn14))
summary(lm1)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
lm2<-(lm(points~attitude, data = lrn14))
summary(lm2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
##1 Data consist of ASSIST survey (Approaches and Study Skills Inventory for Students) and the measures were originally reported in finnish. Data has 166 observations and 7 variables, which are gender(male/female), age, mean scores in attitude scale.
##2 The distributioin for attitude, deep, stra, surf and points look fairly normally distributed. The distribution for age is positively skewed (longer tail on the right) as most of the student taking the survey have been young adults, and the plot shows some outliers. Strongest correlation is seen between attitude and points (.4), and the next strongest associations are with stra (r= .146) and surf (r= .-.144).
##3 Of the three variables used to predict points (attitude,stra,surf, had the strongest correlations), only attitude was associated with points (statistaical signifigance >.001). Lm was fitted again without the surf variable. As the surf was not statististically significant at alphalevel =.05, the final model had only attitude as predictor.
Lm fits a regression line to the data using least residual squares. Intercept reflects the points when all the parameters are set to zero. Model parameters were tested using t-test. T test tests if the coefficient differs from 0, ie. if the predictor affects the slope of the regresssion line. In the final model the increase of 1 point in attitude affects the predicted exam point by +3.5 point, with std error of 0.57.
Multiplle r squared refers to the overall variation that the model explains. It reflects the goodness-of-fit of the model. When comparing the models above, removing surf and stra from the analyses led to minimal attunuation of the adjusted r squared, thus those variables did not explain the exam points well. In the last model r-squared was 0.19, so the model with only attitude explained the approximately 19percent of the variation
par(mfrow= c(1,3))
plot(lm2,par(mfrow= c(2,2)), which = c(1,2,5) )
##4 The residuals vs. fitted vlaues show that most of the residuals are symmertical around zero, implicating that the model fits data ok. There seems to be greater variability with fitted values, which may be due the fact that variance is not constant, so transformation could be used to account that.
In the qq-plot the observations fit to the line, but there is some deviations in the low and high values. This indicates that the association may not be linear. Quadratic model could fit the data better.
Leverage vs. residuals shows that some observations are close to cooks distance, but none are significant. Hence there is no reason to remove any of the observations.
Overall, normality and constance of variance seem fine, and there are no observations that would significantly inmpact the model. Thus we can assue that all the assumptioms of linear regression were met.
Of the variables in data, I chose sex, age, quality of family relationships and current health status. Rationale behind the chosen variables was that as sexes differ in their vulnerability to internalizing/externalizing problems, I hypothesize that male sex would be a risk factor for high use of alcohol. Adolescent in different ages may differ in the perceived peer-pressure to drink alcohol, or to blend in the group, hence I hypothesize that older age (ie. closer to twenty) may also be a risk factor. Lastly family relationships and current health status may cause distress in individual, and alcohol may be used as coping mechanism to alleviate the distress.
alc1<-alc[c("age","sex","famrel", "health", "high_use")]
fig <- ggpairs(alc1, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
fig
table(alc1$sex,alc1$high_use)
##
## FALSE TRUE
## F 154 41
## M 105 70
g2<-ggplot(alc1, aes(x=high_use, y= famrel,col=sex))
g2+ geom_boxplot()
g3<-ggplot(alc1, aes(x=high_use, y=health,col=sex))
g3+ geom_boxplot()
cor(alc1$high_use, alc1$age)
## [1] 0.105543
Contrary to my hypothesis, health does not seem to be associated with high alcohol use, though much of the participants were in good health, which may mask some of the effect (i.e. ceiling effect). High alc use correlated positively with age, which is in line with my hypothesis. Also, in line with the earlier hypothesis, men/boys seem to use alcohol more. Family relationships have high mean, and some outliers at the lower end, hence the variable may not be suited to be used as continuous variable in the analyses (thou that is what I intend to do)
#Logistic regression
r1<-glm(high_use~age+sex+health+famrel, data= alc1, family="binomial")
summary(r1)
##
## Call:
## glm(formula = high_use ~ age + sex + health + famrel, family = "binomial",
## data = alc1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4646 -0.8332 -0.6688 1.1524 2.2500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.17523 1.75142 -2.384 0.01713 *
## age 0.22881 0.09991 2.290 0.02201 *
## sexM 0.95044 0.24127 3.939 8.17e-05 ***
## health 0.12605 0.08830 1.427 0.15344
## famrel -0.36626 0.12858 -2.849 0.00439 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.29 on 365 degrees of freedom
## AIC: 432.29
##
## Number of Fisher Scoring iterations: 4
of the chosen variables male sex and family relationships were the statistically significant at alpha level .001, and age at alpha level .05. Current health status was not associated woth high alcohol use. Male sex was associated with considerable increase (log odds= 0.95) in high alcohol use, and age was associated with slight increase (log odds= .22). Good family relationships were associated with lower log odds of high alcohol use (log odds= -.36).
# compute odds ratios (OR)
OR <- coef(r1) %>% exp
# compute confidence intervals (CI)
CI<-confint(r1)%>% exp
## Waiting for profiling to be done...
print(cbind(OR, CI))
## OR 2.5 % 97.5 %
## (Intercept) 0.01537168 0.0004744108 0.4655281
## age 1.25710707 1.0347423229 1.5328666
## sexM 2.58685790 1.6201078828 4.1787499
## health 1.13434220 0.9560769767 1.3527013
## famrel 0.69332056 0.5371929276 0.8909827
Male sex was associated with 2.5 higher odds of having high alcohol use, compared to females, and the 95% confidence intervals range from 1.6 to 4.2. Family relationship was associated with 0.7 times lower odds of having high alcohol use, meaning that one unit increase in family relationship scale decreased the odds of having high alcohol use. Lastly one year increase in age increased the odds of having high alcohol use by 1.2, and 95 CI ranged from 1.03 to 1.5.
# predict() the probability of high_use
probabilities <- predict(r1, type = "response")
# add the predicted probabilities to 'alc'
alc1 <- mutate(alc1, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc1 <- mutate(alc1, prediction = ifelse(probabilities > 0.5,T,F))
# see the last ten original classes, predicted probabilities, and class predictions
# tabulate the target variable versus the predictions
table(high_use = alc1$high_use, prediction = alc1$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 246 13
## TRUE 91 20
alc1 <- mutate(alc1, prediction = probability > 0.5)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2810811
Model is correct in 246+20/13+91= 72% of the cases, compared to random guessing (50/50). Training error is 28%.
Cross validation
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc1, cost = loss_func, glmfit = r1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3
Current model predicts wrongly in about 30% of the cases in the cross-validation. The number is higher compared to the training data as the model may be over fitted by exploiting the random variation of the training data, which in turn may not be generalized to other data. Cross validation uses smaller subsets of the data to test the generalizability and over-fitting.
#Loading the data and axploring the dimensions and structure
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataset is a part of MASS, and it includes housing values in suburbs of boston. It has 506 observations and 14 variables, of which 12 are numerical and 2 are integers.
#Visualizing
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
cors <- cor(Boston) %>% round(2)
corrplot(cors,method = "pie",type= "upper")
summary(cors)
## crim zn indus chas
## Min. :-0.3900 Min. :-0.5700 Min. :-0.7100 Min. :-0.12000
## 1st Qu.:-0.2150 1st Qu.:-0.4050 1st Qu.:-0.3825 1st Qu.:-0.04750
## Median : 0.3200 Median :-0.2550 Median : 0.3950 Median : 0.02000
## Mean : 0.1786 Mean :-0.0550 Mean : 0.1929 Mean : 0.08143
## 3rd Qu.: 0.4500 3rd Qu.: 0.2775 3rd Qu.: 0.6300 3rd Qu.: 0.09000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## nox rm age dis
## Min. :-0.770 Min. :-0.61000 Min. :-0.7500 Min. :-0.7700
## 1st Qu.:-0.360 1st Qu.:-0.29750 1st Qu.:-0.2625 1st Qu.:-0.5225
## Median : 0.305 Median :-0.21500 Median : 0.3050 Median :-0.3050
## Mean : 0.190 Mean :-0.01286 Mean : 0.1736 Mean :-0.1464
## 3rd Qu.: 0.655 3rd Qu.: 0.19000 3rd Qu.: 0.5775 3rd Qu.: 0.2400
## Max. : 1.000 Max. : 1.00000 Max. : 1.0000 Max. : 1.0000
## rad tax ptratio black
## Min. :-0.4900 Min. :-0.5300 Min. :-0.5100 Min. :-0.44000
## 1st Qu.:-0.2850 1st Qu.:-0.3050 1st Qu.:-0.2175 1st Qu.:-0.37750
## Median : 0.4600 Median : 0.4850 Median : 0.2250 Median :-0.22500
## Mean : 0.2371 Mean : 0.2364 Mean : 0.1157 Mean :-0.06071
## 3rd Qu.: 0.6075 3rd Qu.: 0.6475 3rd Qu.: 0.3775 3rd Qu.: 0.16750
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.00000
## lstat medv
## Min. :-0.7400 Min. :-0.74000
## 1st Qu.:-0.4000 1st Qu.:-0.46000
## Median : 0.4150 Median :-0.38000
## Mean : 0.1407 Mean :-0.06857
## 3rd Qu.: 0.5775 3rd Qu.: 0.31000
## Max. : 1.0000 Max. : 1.00000
#par(mfrow=c(4, 4))
colnames <- dimnames(Boston)[[2]]
for (i in seq_along(colnames)) {
hist(Boston[,i], main=colnames[i], probability=TRUE, col="pink", border="white")
}
The distributions of rm,lstat seem normally distributed. age, dis,medv
seem skewed. Indus and tax seem to have outliers/ otherwise high
variability in the observed values as do black, crim and chas (when
comparing the histograms with the printed summary.
Strongest negative correlations are with distance to employment centers and industry, nitrogen oxide concentration and age (unit build before 1940), all > -.7. Particularly interesting are the weak correlations of chas with any other variable. Of the positive correlations strongest ones are between tax and rad (tax-rate and access to radiall highways), indus and nox and age and nox and indus and nox.
#scaling the data
library(dplyr)
bs <- as.data.frame(scale(Boston))
summary(bs)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#crime as factor using quantiles
q<-quantile(bs$crim)
bs$crime <- cut(bs$crim, breaks = q, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
#removing original crim variable
bs<- dplyr::select(bs, -crim)
Standardizing the data centers the obs around 0 so that the distribution has standard distribution of 1, i.e. the data is normally distributed.
#Fitting lda
#copy-pasting the code for training set
n <- nrow(bs)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- bs[ind,]
# create test set
test <- bs[-ind,]
#fitting the lda
eka<-lda(crime~., data= train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
eka
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2698020 0.2549505 0.2351485
##
## Group means:
## zn indus chas nox rm age
## low 0.81167183 -0.9206005 -0.06938576 -0.8507568 0.4341715 -0.8185735
## med_low -0.07400687 -0.3254960 -0.05560795 -0.5841229 -0.1286559 -0.3626409
## med_high -0.36901585 0.1616886 0.14813795 0.3802477 0.2090196 0.4458561
## high -0.48724019 1.0172655 -0.02367011 1.0552843 -0.4741261 0.7995404
## dis rad tax ptratio black lstat
## low 0.7991229 -0.6906105 -0.7364161 -0.42612800 0.3854350 -0.75305321
## med_low 0.3617409 -0.5446108 -0.4919750 -0.06548531 0.3157669 -0.18713828
## med_high -0.3863683 -0.4176728 -0.3118455 -0.36647466 0.1277994 -0.09906541
## high -0.8593528 1.6366336 1.5129868 0.77903654 -0.8399980 0.89239127
## medv
## low 0.51710336
## med_low 0.01947573
## med_high 0.27448588
## high -0.68249927
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10237909 0.596918661 -1.06752211
## indus 0.04311230 -0.331028056 0.48223613
## chas -0.02622842 -0.002121352 0.02685402
## nox 0.46407532 -0.965253570 -1.19657559
## rm 0.03735499 -0.188860891 -0.28082476
## age 0.23297991 -0.437789978 -0.13455727
## dis -0.08509867 -0.383114418 0.23296421
## rad 3.33582480 0.818208367 -0.01991887
## tax 0.02407188 0.136833881 0.20432316
## ptratio 0.13498305 0.024654329 -0.28459343
## black -0.12231523 0.023483289 0.08825579
## lstat 0.14546144 -0.034978551 0.49015104
## medv 0.08077687 -0.376207161 -0.01231084
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9522 0.0357 0.0121
classes <- as.numeric(train$crime)
plot(eka, dimen = 2,col=classes, pch=classes)
lda.arrows(eka, myscale = 2)
Predicting the correct classes
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
pred <- predict(eka, newdata = test)
table(correct=correct_classes, predicted=pred$class)
## predicted
## correct low med_low med_high high
## low 20 9 1 0
## med_low 1 12 4 0
## med_high 1 13 7 2
## high 0 0 0 32
The lda predicts the “high” class well, with few to none misclassifications. There seems to ambiquity with lower levels of the cime variable.
##Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
data("Boston")
dat1<-scale(Boston)
dat<-as.data.frame(scale(Boston))
#Distances
dist_eu<-dist(dat1)
#setting seed for reproducibility
set.seed(666)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dat, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(dat, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[6:10], col = km$cluster)
2 seems to be the optimal number of groups. Some variables seem to
differentiate the groups better, for example tax and rad, but for most
variables there seems to be a lot of overlap.
Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises).
set.seed(999)
km2<-kmeans(dat,centers = 4)
targ<-km2$cluster
#variable chas appeared to be constant across groups, thus it was removed
lda2<-lda(targ~crim+zn+age+dis+rad+tax+ptratio+black+lstat+medv+nox+indus+rm, data = dat)
lda2<-lda(targ~., data = dat)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(targ)
# plot the lda results
plot(lda2, dimen = 2,col= classes)
lda.arrows(lda2, myscale = 1)
pairs(dat[6:10],col= targ)
Radiation seems to affect the model most. ***
Downloading the data, and showing summaries and graphical overview
library(dplyr)
library(GGally)
dat<-read.csv("./data/human2.csv")
row.names(dat)<-dat$X
dat<-dat[,2:9]
summary(dat)
## sedu_rat lab_rat edu liexp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni mat_mor abr repre_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(dat)
Data has 155 observations and 8 variables.
VARIABLES: country= Country liexp= life expectancy gni= Gross National Income per capita liexp = Life expectancy at birth edu = Expected years of schooling mat_mor = Maternal mortality ratio abr = Adolescent birth rate repre_parl = Percetange of female representatives in parliament sedu_rat = ratio of females with at least secondary education divided by males with at least secondary education. lab_rat= proportion of females in work force divided by the proportion of males in work force.
Most of the variables seem to normally distributed. abr seems to be skewed to the right, as is mat_mor. gni is heavily skewed to the right. Life-expectancy seems.
Principal component analysis (PCA) on the raw (non-standardized) human data and bi-plot
A<-prcomp(dat)
summary(A)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
s<-summary(A)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")
biplot(A, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
PCA on standardized data
dat1<-scale(dat)
dat1<-as.data.frame(dat1)
B<-prcomp(dat1)
summary(B)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
s<-summary(B)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr2<-round(pca_pr*100,digits = 2)
pca_lab<-paste0(names(pca_pr), " (", pca_pr2, "%)")
biplot(B, choices = 1:2, cex = c(0.8, 1), col = c("grey70", "deeppink2"), xlab = pca_lab[1], ylab = pca_lab[2])
Both plot and the explained variance changed drastically. PCA seeks to explain variance. Without standardizing, the first variable seems to explain most of the variance. After standardizing it is obvious that other components also contribute, but only the first two components explain more than 10% of the variance.
In the first plot, only gni /which happens to be the first variable) seems to explain the variance, whereas in the second plot maternal mortality, birth rate, education, life expectancy and ratio of the secondary education seem to explain the first component and ratio of labour force and representation of women in parliament explain the second component.
Interpretations of the obtained results
It seems that first component relates to welfare statish and the component2 more to equity of the sexes. When looking at the countries in the plot, most of the western states are located in the left, whereas less wealthy countires are located in the right. When looking at the second component, countries in which the status of women is not equal to that of men are located at the bottom of plot. INterestingly some of the african countries are at the top of the plot. It is worth noting that there may be some interactions between the components chosen by pca, as in wealthier countries it may possible for the other sex to not work, whereas in poorer countries both sexes may have to work
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
library(dplyr)
library(tidyr)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
tea1<-(tea[,1:10])
ggpairs(tea[,1:10])
pivot_longer(tea1, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Exploring the data briefly.
library(FactoMineR)
mca <- MCA(tea1, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea1, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.179 0.129 0.115 0.107 0.104 0.090 0.079
## % of var. 17.939 12.871 11.481 10.654 10.406 8.985 7.917
## Cumulative % of var. 17.939 30.810 42.291 52.945 63.351 72.337 80.253
## Dim.8 Dim.9 Dim.10
## Variance 0.076 0.068 0.054
## % of var. 7.559 6.793 5.395
## Cumulative % of var. 87.812 94.605 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 2 | -0.543 0.549 0.474 | -0.462 0.552 0.342 | -0.007 0.000
## 3 | -0.068 0.009 0.002 | 0.542 0.761 0.141 | -0.139 0.056
## 4 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 5 | -0.235 0.103 0.061 | -0.011 0.000 0.000 | 0.124 0.044
## 6 | -1.037 1.998 0.558 | 0.264 0.180 0.036 | -0.018 0.001
## 7 | 0.024 0.001 0.001 | -0.404 0.422 0.374 | -0.185 0.099
## 8 | -0.193 0.069 0.054 | 0.058 0.009 0.005 | -0.382 0.423
## 9 | -0.258 0.124 0.117 | -0.624 1.007 0.680 | -0.130 0.049
## 10 | -0.048 0.004 0.002 | -0.153 0.061 0.020 | -0.176 0.090
## cos2
## 1 0.000 |
## 2 0.000 |
## 3 0.009 |
## 4 0.000 |
## 5 0.017 |
## 6 0.000 |
## 7 0.078 |
## 8 0.210 |
## 9 0.030 |
## 10 0.027 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | 0.209 1.168 0.040 3.471 | -0.770 22.090 0.547
## Not.breakfast | -0.193 1.078 0.040 -3.471 | 0.710 20.391 0.547
## Not.tea time | -0.680 11.253 0.358 -10.351 | 0.327 3.635 0.083
## tea time | 0.527 8.723 0.358 10.351 | -0.254 2.817 0.083
## evening | 0.445 3.787 0.103 5.562 | 0.634 10.728 0.210
## Not.evening | -0.233 1.980 0.103 -5.562 | -0.332 5.609 0.210
## lunch | 0.947 7.329 0.154 6.788 | 0.167 0.317 0.005
## Not.lunch | -0.163 1.260 0.154 -6.788 | -0.029 0.054 0.005
## dinner | -1.571 9.633 0.186 -7.454 | 1.044 5.925 0.082
## Not.dinner | 0.118 0.725 0.186 7.454 | -0.079 0.446 0.082
## v.test Dim.3 ctr cos2 v.test
## breakfast -12.786 | -0.031 0.041 0.001 -0.518 |
## Not.breakfast 12.786 | 0.029 0.038 0.001 0.518 |
## Not.tea time 4.983 | 0.235 2.102 0.043 3.579 |
## tea time -4.983 | -0.182 1.630 0.043 -3.579 |
## evening 7.929 | -0.599 10.742 0.188 -7.494 |
## Not.evening -7.929 | 0.313 5.616 0.188 7.494 |
## lunch 1.195 | -1.133 16.410 0.221 -8.125 |
## Not.lunch -1.195 | 0.195 2.820 0.221 8.125 |
## dinner 4.951 | -0.090 0.050 0.001 -0.428 |
## Not.dinner -4.951 | 0.007 0.004 0.001 0.428 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.040 0.547 0.001 |
## tea.time | 0.358 0.083 0.043 |
## evening | 0.103 0.210 0.188 |
## lunch | 0.154 0.005 0.221 |
## dinner | 0.186 0.082 0.001 |
## always | 0.089 0.096 0.414 |
## home | 0.008 0.114 0.005 |
## work | 0.215 0.006 0.251 |
## tearoom | 0.315 0.003 0.018 |
## friends | 0.325 0.141 0.008 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")
According the plot, having tea at diner or not at home have the greatest impact on the first 2 dimensions, variables close to horizontal x-axis have no lesser effect on dimension 2 and variables closer to y axis have less effect on dimension1. Most of the “not” variables are close to origo, and their counterparts are located further away, meaning that they have impact on the two dimensions.
Dimension 2 seems to reflect the social affect of drinking tea. Factors related to social aspects of drinking tea seem to affect the y-value in the plot. Dimension 1 is harder to interpret, but one possibility is that dim1 reflects the “tea as a hobby” kind of aspect. things like drinking tea in a tearoom and having tea time.
Loading the data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.7 ✔ stringr 1.4.0
## ✔ readr 2.1.3 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
dat1<-read.csv("./data/BPRS.csv")
dat2<-read.csv("./data/rats.csv")
dat2og<-read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",sep = "\t",header = T)
#changing to factors
dat1<-dat1 %>% mutate(treatment=as.factor(treatment)) %>%
mutate(subject= as.factor(subject)) %>%
mutate(id=as.factor(id))
dat2<-dat2 %>% mutate(ID=as.factor(ID)) %>%
mutate(Group=as.factor(Group))
#dat1 <- dat1 %>% mutate(week = as.integer(substr(weeks,5,5)))
#dat2 <- dat2 %>% mutate(time = as.integer(substr(time,3,3)))
Dat2 has data on xxx
First I will visualize the data
dat2 <- dat2 %>%
group_by(time1) %>%
mutate( stdweight = (weight - mean(weight))/sd(weight) ) %>%
ungroup()
# Plot again with the standardised weight
library(ggplot2)
ggplot(dat2, aes(x = time1, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
There seems to a noticable difference between group1 and other groups.
Though it is of notice that groups 2 and 3 have 4 and 3 participants,
respectively.
dat22 <- dat2 %>%
filter(time1>1) %>%
group_by(Group, time1) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(16) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(dat22, aes(x = time1, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
dat22 <- dat2 %>%
filter(time1>1) %>%
group_by(Group, ID) %>%
summarise( mean = mean(weight), se = sd(weight)/sqrt(16) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(dat22, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), time 8-64")
Time doesnt seem to effect weight, and values at baseline predict the
course pretty well.
Lets conduct anova to test the differences. Though according the plots groups 2 and three differ from group1.
dat3 <- dat2 %>%
filter(time1 > 1) %>%
group_by(Group,ID) %>%
summarise( mean=mean(weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
dat3 <- dat3 %>%
mutate(baseline = dat2og$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean~baseline+Group, data = dat3)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Groups are close to statistical signifigance. Baseline seems to predict mean well, and groups did not differ from each other.
I will do contrast to checj if group 1 differs from other groups
dat4<-dat3 %>% mutate(group1=ifelse(Group==1,1,2))
fit2 <- lm(mean~baseline+group1, data = dat4)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit2)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1806.5913 2.448e-15 ***
## group1 1 690 690 4.9159 0.04505 *
## Residuals 13 1825 140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we combine group 2 and 3, they seem to differ from group 1 at alpha-level .05. As the analysis was done ad-hoc, it should not be interpreted as hypothesis testing, but rather as exploring the data.
##Task2
Dat1 has data on 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment (week 0) and follow-up data waws collected weekly for 8 weeks. The BPRS is used to evaluate patients suspected of having schizophrenia.
Standardizing the data and drawing plots
dat1 <- dat1 %>%
group_by(week) %>%
mutate( stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
ungroup()
# Plot again with the standardised bprs
library(ggplot2)
ggplot(dat1, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
dat11<-dat1%>% arrange(week)
ggplot(dat11, aes(x = week, y = stdbprs, group = id)) +
geom_line(aes(linetype=treatment))+
scale_y_continuous(name = "bprs")+
theme(legend.position = "top")
Standardized bprs scores do not seem to differ greatly between groups.
##Regression
Next I fit regression to the data, in a stepwise manner. First a regular linear regression that does not take into account that data has multiplle observation from each of the participants, ie. we cannot assume that the observation are independnt of each others, but clustered within participants. Next I fit a model with that has a random intercept to account for the correlations within individual, so that intercept is allowed to vary between participants. LAstly in this section I will fit random intercept random slope model in, in addition to intercept, also slope is allowed to vary by participant.
# create a regression model RATS_reg
reg <- lm(bprs~ week+treatment, data = dat1)
# print out a summary of the model
summary(reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
ref <- lmer(bprs ~ week + treatment + (1 | subject), data = dat1, REML = FALSE)
# Print the summary of the model
summary(ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = dat1, REML = FALSE)
# print a summary of the model
summary(ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(ref1, ref)
## Data: dat1
## Models:
## ref: bprs ~ week + treatment + (1 | subject)
## ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## ref 5 2748.7 2768.1 -1369.4 2738.7
## ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random intercept reduces the error of the model compared to regular regression. According to anova, the model with random intercept and random slope fits the data better at alpha level =.05 . This means that ref1 explains more variation in the data compared to ref.
Lastly, I will fit a model that allows interaction between week and treatment
ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week*treatment), data = dat1, REML = FALSE)
summary(ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0513 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0048 8.0626
## week 0.9687 0.9842 -0.51
## Residual 96.4702 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(ref2,ref1)
## Data: dat1
## Models:
## ref1: bprs ~ week + treatment + (week | subject)
## ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## ref1 7 2745.4 2772.6 -1365.7 2731.4
## ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitted <- fitted(ref1)
dat1$fitted<-Fitted
# draw the plot of RATSL with the Fitted values of weight
ggplot(dat1, aes(x = week, y = fitted, group= id)) +
geom_line(aes(linetype=treatment))+
scale_x_continuous(name = "week", breaks = seq(0, 10, 2)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
Model that allows interaction between week and treatment does not improve the fit at alpha level = .05. When looking at the results from ref1, ( model with random intercepts and random slopes) the two groups do not seem to differ from each other. We dont know if the other group was control group or did they compare two different treatments, so all we can say is that the treatments do not differ between their effect on bprs. All the participants showed a linear trend, with weeks being associated with lower scores on bprs.